Classical Chinese Sentence Segmentation

نویسندگان

  • Hen-Hsen Huang
  • Chuen-Tsai Sun
  • Hsin-Hsi Chen
چکیده

Sentence segmentation is a fundamental issue in Classical Chinese language processing. To facilitate reading and processing of the raw Classical Chinese data, we propose a statistical method to split unstructured Classical Chinese text into smaller pieces such as sentences and clauses. The segmenter based on the conditional random field (CRF) model is tested under different tagging schemes and various features including n-gram, jump, word class, and phonetic information. We evaluated our method on four datasets from several eras (i.e., from the 5th century BCE to the 19th century). Our CRF segmenter achieves an F-score of 83.34% and can be applied on a variety of data from different eras.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese sentence segmentation as comma classification

We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detectin...

متن کامل

Segmentation of Chinese Long Sentences Using Commas

The comma is the most common form of punctuation. As such, it may have the greatest effect on the syntactic analysis of a sentence. As an isolate language, Chinese sentences have fewer cues for parsing. The clues for segmentation of a long Chinese sentence are even fewer. However, the average frequency of comma usage in Chinese is higher than other languages. The comma plays an important role i...

متن کامل

A Segmentation Matrix Method for Chinese Segmentation Ambiguity Analysis

Chinese Segmentation Ambiguity (CSA) is a fundamental problem confronted when processing Chinese language, where a sentence can generate more than one segmentation paths. Two techniques are commonly used to identify CSA: Omni-segmentation and Bi-directional Maximum Matching (BiMM). Due to the high computational complexity, Omni-segmentation is difficult to be applied for big data. BiMM is easie...

متن کامل

Word Segmentation as Graph Partition

We propose a new approach to the Chinese word segmentation problem that considers the sentence as an undirected graph, whose nodes are the characters. One can use various techniques to compute the edge weights that measure the connection strength between characters. Spectral graph partition algorithms are used to group the characters and achieve word segmentation. We follow the graph partition ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010